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Abstract 

An important step in developing individualized treatment strategies is to correctly iden¬ 
tify subgroups of a heterogeneous population, so that specihc treatment can be given to 
each subgroup. In this paper, we consider the situation with samples drawn from a pop¬ 
ulation consisting of subgroups with different means, along with certain covariates. We 
propose a penalized approach for subgroup analysis based on a regression model, in which 
heterogeneity is driven by unobserved latent factors and thus can be represented by using 
subject-specihc intercepts. We apply concave penalty functions to pairwise differences of 
the intercepts. This procedure automatically divides the observations into subgroups. We 
develop an alternating direction method of multipliers algorithm with concave penalties to 
implement the proposed approach and demonstrate its convergence. We also establish the 
theoretical properties of our proposed estimator and determine the order requirement of the 
minimal difference of signals between groups in order to recover them. These results provide 
a sound basis for making statistical inference in subgroup analysis. Our proposed method is 
further illustrated by simulation studies and analysis of the Cleveland heart disease dataset. 
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1 Introduction 


Personalized medicine has gained mnch attention in the past decade, which emphasizes the 
nse of information available on individnal patients to make treatment decisions. Develop¬ 
ing individnalized treatment strategies reqnires sophisticated analytic tools. One of the key 
statistical challenges is to correctly identify snbgronps from a heterogeneons popnlation, so 
that specihc medical therapies can be given to each snbgronp. A popnlar method for ana¬ 
lyzing data from a heterogeneons popnlation is to view data as coming from a mixtnre of 
snbg ronps with their own set s of parameter valnes and then nse hnite mixtnre model anal¬ 


ysis (lEveritt and Hand! . Il98ll ) . 
clnstering and clas sihc ation; see 


Shen and He 


'he mixtnre model approac 


Banheld and RaftervI fll993h . 


las been widely nsed 


Hastie and Tibshirani 


or data 


( 119961 ), 


McNicholasI f2010h and IWei and KosorokI fl2013l ) for the Gaussian mixt ure model approaches. 


(120151 ) for a logistic-normal mixture model method, and IChaganty and Liang 


(120131 ) for a low-rank method for mixtures of linear regressions which provides a good initial¬ 


ization for the EM algorithm typically used in estimation of mixture models. The mixture 
model-based approach as a supervised clustering method needs to specify an underlying dis¬ 
tribution for the data, and it also requires specifying the number of mixture components in 
the population which is often difficult to do in practice. 

In this paper, we propose a new approach to automatically detecting and identifying 
homogeneous subgroups based on a concave pairwise fusion penalty without the knowledge 
of an a priori classihcation or a natural basis of separating a sample into subsets. Let 
Hi be the response variable for the subject. After adjusting for the effects of a set of 
covariates Xj = (x^i, ..., Xip)"*", we consider subgroup analysis for y = (|/i, with 
the heterogeneity driven by unknown or unobserved latent factors, which can be modeled 
through subject-specihc intercepts in regression. Hence, we consider 


l/i = /ij = 1 , ... ,n, ( 1 ) 

where /ids are unknown subject-specihc intercepts, f3= (/3i,..., is the vector of unknown 
coefficients for the covariates x*, and e* is the error term independent of Xj with E{ei) = 0 
and Var(ej) = For example, in biomedical studies, Ui can be certain phenotype associated 
with some disease such as the maximal heart rate which is related to cardiac mortality or 
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body mass index associated with obesity, and x* is a set of observed covariates such as 
gender, age, race, etc. After adjusting for the effects of the covariates, the distribution of 
the response is still heterogeneous, as demonstrated by multiple modes in the density plot 
shown in Figure [5] for our heart disease application. This heterogeneity can be caused by 
unobserved latent factors, so that it is modeled through the subject-specific /ij’s. 

It is worth noting that if the factors contributing to this heterogeneity, for example, 
different treatments, become available, then /ij can be written as /i* = fi + zJO, where Zj are 
the observed variables for the treatments and 0 are the coefficients of Zj. One interesting 
application in personalized medicine is that the coefficients for Zj can be subject-specific, 
since the same treatment may have different effects on patients. For this case, we can 
consider the model with heterogeneous effects of some covariates given as 

l/j =/i-h z^^i-Fx7/3-Fei,i = 1, ... , 77 ,. (2) 


Throughout this paper, we focus on studying model ([T]) by considering that the hetero¬ 
geneity comes from unobserved latent factors. However, our proposed estimation method 
and the associated theoretical properties for model ([I]) can be extended to model ([2]) with 
some modihcations. We provide the detailed estimation procedure for model (Ej) in Section 
A.4 of the Supplemental Materials for interested readers. Assumptions of the structure are 
needed in order to estimate model ([T]). To this end, we assume that y = (t/i, ..., Hn)^ are 
from K different groups with K >1 and the data from the same group have the same inter¬ 
cept. In other words, let Q = ..., Qk) be a partition of {1,..., n}. We have fii = for 

all i E Qk, where is the common value for the /ij’s from group Qk- In practice, the number 
of groups K is unknown. However, it is usually reasonable to assume that K is much smaller 
than n. Our goal is to estimate K and identify the subgroups. We are also interested in es¬ 
timating the intercepts (ai,..., ax) and the regression parameter (3. We propose a concave 


pairwise fusion penalized least squares a pproach for 


his p urpose and derive an alternating 


direction method of multipliers (ADMM, iBoyd et all (120111 )) algorithm for implementing the 
proposed approach. 

Several authors have studied the problem of exploring homogeneity effects of covariates 
in the regression setting by assuming that the true coefficients are divided into a few clus- 
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ters with common values. For instance, iTibshirani et all (120051) proposed the fused LASSO 


method which applies Li penalties to th e pairs of adjacent coordin ates given that a com¬ 


plete ordering of covariates is available. 


Bondell and Reich 


( 20081 ) proposed the OSCAR 


method where a s pecial octagonal shrinkage penalty is applied to each pair of coordinates. 


Shen and Huand (120101) develo ped a group pu rsuit approach with truncated Li penalties 


to the pairwise differences, and iKe et all (120131 ) proposed a method called CARDS. All the 


above methods are about estimating homogeneity effects of co variates, w 


from our work aiming to identify subgroups of the observations. 


Guo et al 


rich is different 


20101) proposed 


using a pairwise Li fusion penalty for identifying informative variable in the context of 
Gau ssian model-bas e d clu ster analysis. In the unsupervised learning setting, a recent pa¬ 
per (iGhi and Langel (120011) ) considered the convex clustering problem and investigated the 
ADMM and the alternating minimization algorithms with the convex Lp {p > 1) penalties 
applied to the pairwise differences of the data points. 



can shrink some pairwise differences of the parameter estimates to zero. However, the Li 
penalty generates large biases of the estimates in each iteration of the algorithm. As a re¬ 
sult, it m ay not be able to iden tify the subgroups, as illustrated in Figure [H To address 


this issue. 


Ghi and Langel (120011) propose to multiply nonnegative weights to the Li norms 


to reduce the bias. However, the choice of the weights can dramatically affect the quality 
of the clustering solution, and there is no clear rule for how to choose the weights. Thus, a 
penalty which can produce unbiased estimates is more desirable for identifying subgroups. 
We propose an ADMM algorithm by using concave pairwise fusion penalties for estimation of 
model ([T]). The concave penalties in the optimization p roblem such as the smoothly clipped 
absolut e devia t ions p enalty (SGAD, iFan and Lil (120011 )) and the minimax concave penalty 
(MGP, IZhand (120101 )1 enjoy the unbiasedness property. We then derive the convergence 
properties of the ADMM algorithm. Moreover, we provide theoretical analysis of the pro¬ 
posed estimators. Specihcally, we derive the order requirement of the minimum difference of 
signals between groups in order to identify the true subgroups. We also establish the oracle 
property that under mild regularity conditions the oracle estimator is a local minimizer of 
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the objective function with a high probability. The oracle estimator is obtained from least 
squares regression by assuming that the true group structure is known. 

The rest of this paper is organized as follows. In Section [2] we describe the proposed 
approach in detail. In Section [3] we derive an ADMM algorithm with concave penalties. We 
then state the theoretical properties of the proposed approach in Section 01 In Sections [5] 
we evaluate the finite sample properties of the proposed procedures via simulation studies. 
Section [6] illustrates the proposed method through a data example. Some concluding remarks 
are given in Section [3 The estimation procedure for model ([2]) and all the technical proofs 
are provided in the on-line Supplemental Materials. 


2 Subgroup analysis via concave pairwise fusion 


For estimation of model o. we propose a concave pairwise fusion penalized least squares 
approach. The objective function is 


J. ^ rp 2 % ^ 

Qn(/^,/3;A) = -> (?/,-/ii-x./3) 

^ % —X 


(3) 


where pi = (pi,..., and p{-, A) is a concave penalty function with a tuning parameter 
A > 0. 

For a given A > 0, define 


(m(A), 3(A)) = argmin^^^ /3; A). 


The penalty shrinks some of the pairs pj — pk to zero. Based on this, we can partition the 
sample into subgroups. Specifically, let A be the value of the tuning parameter selected based 
on a data-driven procedure such as the BIG. For simplicity, write (m,/ 3) = (^(A),/3(A)). Let 
{«!,..., be the distinct values of /x. Let Gk = {i ■ V'i = ^k, 1 < * < n}, 1 < k < K. 
Then {Qi, ..., constitutes a partition of {1,..., n}. 

An important question is which penalty function should be used here. The Li penalty 
with p^(t, A) = At applies the same thresholding to all pairs \fii — Pj\. As a result, it leads to 
biased estimates and may not be able to correctly recover the subgroups. This is similar to 
the situation in variable selection where the lasso tends to over-shrink large coefficients. In 
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our numerical studies, we found that the Li penalty tends to either yield a large number of 
subgroups or no subgroup on the solution path. Hence, a penalty which can produce unbiased 
estimates is more appealing. This motivates us to us e the conca v e pen alties including the 
smoothly clipped absol ute dev i ation penalty (SCAD, iFan and Lil (120011) ) and the minimax 
concave penalty IMCP. IZhand fj2010l) b These penalties are asymptotically unbiased and are 
more aggressive in enforcing a sparser solution. Thus, they are better suited for the current 
problem, since the number of subgroups is usually much smaller than the sample size. 

The MCP has the form 


A) = A / (1 - x/{-i\))+dx, 7 > 1, 


and the SCAD penalty is 


A) = A / min{l, (7 - x/A )+/(7 - l)}dx, 7 > 2 , 


where 7 is a parameter that controls the concavity of the penalty functions. In particular, 
both penalties converge to the Li penalty as 7 —)■ cxo. Here and in the rest of the paper, we 
put 7 in the subsc ript to indicate th e dependence of these penalty functions on it. Following 


Fan and Li ( 2001 ) and 


Zhand (120lOf) . we treat 7 as a fixed constant. These concave penalties 


enjoy the sparsity as the Li penalty that it can automatically yield zero estimates. More 
importantly, it has the unbiasedness property in that it does not shrink large estimated 
parameters, so that they remain unbiased in the iterations. This property is particularly 
essential in the ADMM algorithms since the biases in the iterations may signihcantly affect 
the search for subgroups. 


3 Computation 

It is difficult to compute the estimates directly by minimizing the objective function (13|) 
due to the fact that the penalty function is not separable in /rds- We reparameterize the 
criterion by introducing a new set of parameters rjij = pj — fij. Then the minimization of 
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(]3|) is equivalent to the constraint optimization problem, 



subject to /ij — fij — rjij = 0 , 


( 4 ) 


where 77 = < j}- By the augmented Lagrangian method (ALM), the estimates of the 

parameters can be obtained by minimizing 



where the dual variables v = {vij^i < ]} are Lagrange multipliers and 'd is the penalty 
parameter. We compute the estimators of (/x, f3, 77, v) through iterations by the ADMM. 

It is noteworthy that by using the concave penalties, although the objective function 
L{^, I3, 77 , v) is not a convex function, it is convex with respect to each rjij when 7 > I/t? for 
the MCP penalty and 7 > !/'& + 1 for the SCAD penalty. Moreover, for given [n, f3,r),v), 
the minimizer of L{fi, f3, 77, v) with respect to r]ij is unique and has a closed-form expression 
for the Li, MCP and SCAD penalties, respectively. Specihcally, for given (/x,/3, 77 , t;), the 
minimization problem is the same as minimizing 


2 ('^7 Vij) P'filVijli 


( 6 ) 


with respect to rjij, where Sij = fii — fij + ^Vij. Hence, the closed-form solution for the Li 
penalty is 


rjij = ST((5y, A/t?), 


(7) 


where ST(t, A) =sign(t)(|t| — A)+ is the soft thresholding rule, and (x)+ = x if x > 0, and 
(a;)+ = 0 otherwise. For the MCP penalty with 7 > l/'d, it is 



For the SCAD penalty with 7 >l/'d-|-l, itis 


ST(5,„A/79) 


if l^ijl < X + X/'d 
if A -|- A/t? < \5ij\ < 7 A 
if \6ij\ > 7 A 


) ST(7j,7A/((7-l)»^)) 

Sij 


(9) 
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3.1 Algorithm 

We now describe the computational algorithm based on the ADMM for minimizing the 
objective function ([3]). It consists of steps for iteratively updating and v. Denote 

the L 2 norm of any vector a by ||a||. The main ingredients of the algorithm are as follows. 

First, for a given (77, v), to obtain an update of pi and f3, we set the derivatives 
dL^n, l3, 77 , v) / dfj, and /3, 77 , v) / d/3 to zero, where 

L(/x,/3,77,t;) = (7/i-pi-x7/3)V^y'. .{(ei-ej)^/x-r/y+79"Vy}^ + C' 

= ^ llM-y + X/3||V ^ ||A77-77+79"^t;||%a (10) 

Here C is a constant independent of fj, and f3, y = ( 7 / 1 ,..., yn)'^, X = (x^, ..., x^)"^, e* is the 
ith unit n X 1 vector whose ith element is 1 and the remaining ones are 0, and A= {(e* — 
ej),i < Thus, for given 77 ^"*^ and at the m}^ step, the updates and 

which are the minimizers of /3,r]^'^\v^"^'>), are 

where Qx = X(X'^X)~^X"'", and 

j^im+i) ^ (x^X)~^X'^(y - 

We further can derive I+'dA'^A=(l + 77.79)1—'dll"'". 

Second, the update of rjij at the {m + 1)'^'^ iteration is obtained by the formula given in 
(ED, (ED and do]), respectively, by the Lasso, MCP and SCAD penalties with replaced by 

Am+l) _ (m+1) _ (m+1) , o-l M 

Uj^j flj “T 1 / Uj^j 

Finally, the estimate of Vij is updated as 

Based on the above discussion, the algorithm consists of the following steps: 

Step 1. Find initial estimates from least squares regression by letting /i* = p for all 
i. Let the initial estimates = y — X/3*^°\ 77 ^°^ = — 77 ^°^ and = 0. 

Step 2. At iteration 777 + 1, compute ^ methods 


described above. 


Step 3. Terminate the algorithm if the stopping rule is met at step m + 1. Then 
(^(m+i)^ estimates (^,/3 ,t 7, 0). Otherwise, we go to 

Step 2. 

Remark 1. We track the progress of the ADMM based on the primal residual = 

^^(m+i)_T^(m+i) _ g|.Qp algorithm when is close to zero such that || < e 

for some small value e. 

Remark 2. This algorithm enables us to have %j = 0 for a large A. We put yi and yj 
in the same group if %j = 0. As a result, we have K estimated groups ^i,..., and let the 
estimated intercept for the group be 0*’ where \Qk\ is the cardinality 

of Gk- 


3.2 Convergence of the algorithm 

We next consider the convergence properties of the ADMM algorithm. 

Proposition 1. The primal residual r*^™) = and the dual residual = 

'0 of the ADMM satisfy thatlimm^oo ||r^”*)|p = 0 andlirnm^-oo ||s^”*)|p = 0 

for both of the MCP and SCAD penalties. 

The proof of this result is given in the online supplement. Proposition [T] shows that the 
primal feasibility and dual feasibility are achieved by the algorithm. Therefore, it converges 
to an optimal point. This optimal point may be a local minimum of the objective function 
when a concave penalty function is applied. 

4 Theoretical properties 

In this section, we study the theoretical properties of the proposed estimator based on 
concave penalty functions. Specifically, we derive the order requirement of the minimum 
difference of signals between groups in order to recover the true groups and the oracle 
property that under some regularity conditions the oracle estimator is a local minimizer of 
the objective function with a high probability. Let M.g be the subspace of i?”, defined as 

A4g = {/X gR” : pLi = /ij, for any i,j^GkA^k< K}. 
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For each /x ^Mg, it can be written as /x = Za, where Z = {zik\ is the n x K matrix 
with Zik = 1 for i E Qk and Zik = 0 otherwise, and ck is a iF x 1 vector of parame¬ 
ters. By matrix calculation, we have D = Z^Z =diag(|^i|,..., \Qk\), where \Qk\ denotes 
the number of elements in Qk- Dehne \Qrmn\='^^'^i<k<K\Gk\ and |^max| = niaxi<fc<i^ l^^l- 
Let X = (X^,... Xp), where X^ is the jth column of X. Denote 

p(t) = and p(t) = p'(|t|)sgn(t). 

For any vector ^ = (Ci,..., Cs)"*" G ■, denote ||C|loo ~ niaxi</<s |(^;|. For any symmetric 
matrix A^xs, denote its L 2 norm as || A|| = max^g/js ||^||=i || A^|l, and let Amin(A) and A mav (A) 
be the smallest and largest eigenvalues of A, respectively. For any matrix A = 
denote || A||^ = maxi<j<s lApI- We introduce the following conditions. 

(Cl) Assume ||Xj|| = for 1 < j < p, Amin[(Z, X)'^( Z, X)] >Ci |^min|, and ||X||^ < C 2 P 
for some constants 0 < Ci < 00 and 0 < C 2 < 00 . 

(C2) p^{t,X) is a symmetric function of t, and it is non-decreasing and concave in t for t 
G [0, 00 ). p(t) is a constant for all t > a\ for some constant a > 0, and p(0) = 0. p'(t) 
exists and is continuous except for a finite number of t and p'(O-l-) = 1. 

(C3) The noise vector e = (ei,..., has sub-Gaussian tails such that P(|a'^e| > ||a||x) < 
2exp(—Cix^) for any vector a gP” and x > 0, where 0 < Ci < cxo. 

Conditions (C2) and (C3) are common assumptions in high-dimensional settings. The 
concave penalties such as MCP and SCAD satisfy Condition (C2). In the literature, it is 
commonly assumed that the smallest eigenvalue of the design matrix is bounded by Cin, 
which may not hold for (Z,X)"'"(Z,X). For instance, by letting Z"'"X = 0 and assuming 
Amin(X'^X) =Cn, we have 

Amin[(Z,X)'^(Z,X)] >min{Amin(D),Amin(X'^X)} =min(|^min| ,Cn), 
and l^minl < n/K. Therefore, we let the smallest eigenvalue in Condition (Cl) be bounded 

by C*i l^min I • 
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When the true group memberships ^i,..., Qk are known, the oracle estimators for /j, and 
(3 are 

(/^°^ 3°'') = arg min ^||y -/x - X/3||^ (11) 

^iGMg,f3eIiP Z 

and correspondingly, the oracle estimators for the common intercepts ck and the coefficients 
f3 are given by 

3°"') = arg min l-\\y - Zcx - X(3\f 

,l3eRP 1 

= [(Z,X)^(Z,X)]-3z,X)V 

Let oP = (a^, A; = 1,..., X)"*", where Q!° is the underlying common intercept for group 
Let be the underlying regression coefficient. 

Theorem 1. Suppose conditions (C1)-(C3) hold. If K = o{n), p = o{n), and 

I Grain I > y^l^^IpjnAo^, 


we have with probability at least 1 — 2{K + p)n 


- (3Y) 


0\T\T 


^ 0n) 


where 


<pn = Cl + p l^minl ^ \/ulogH. 

Moreover, for any vector G we have as n ^ oo, 

- a^)^, (X - YY ^ N{f), 1), 


( 12 ) 


(13) 


(14) 


where 

a„(aj = a{aj[( Z, X)^(Z, X)]-'a„y^'. (15) 

The proof of this theorem is given in the online supplement. 

Remark 3. Since \Grma\ "iSn/K ,hy the condition |^min| S> \/ {K + p)n logn, K and p 
must satisfy {K + p) = o{y/n{\ogn)~^), and hence K = oY^^ilogn)' ^/^). By letting 
l^minl = dn/K for some constant 0 < <5 < 1, the bound (IT^ is K + pY/logn/n. 

Moreover, when K and p are fixed numbers, the bound ( [T^ is C*^/\ogn/n for some constant 
0 < (7* < cx). 
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Let 


^T7. 


mm 


i&Qk 


l/^° - ^A\ = l"fc “ " 


k^k' 


be the minimal difference of the common values between two groups. 


Theorem 2. Suppose the conditions in TheoremUlhold. If bn > aX and A 3> 4>n, where cfn 
is given in m, then there exists a local minimizer (/r(A)^,/3(A)^)'^ of the objective function 
Qn{l-i, (3; X) given in ^ satisfying 

^ ((m(A)^,3(A)^)^= ^ 1- 

The proof of this theorem is given in the online supplement. 

Remark 4. The above result holds given that 3> (fn- As discussed in Remark 3, 
when K is a hnite and hxed number and |^min| = dn/K for some constant 0 < 5 < 1, 
bn S> C*yd\ogriJn for some constant 0 < C* < cxo. Moroever, Theorem [2] shows that 
the oracle estimator (/3 is a local minimizer (^(A)"'",/3(A)"'")"'" of the objective 

function with probability approaching 1. Let ck(A) be the distinct values of fi{X). Also 
a°^ consists of the distinct values of /2°^. By the oracle property in Theorem [21 we have 
F{S(A) = a°^} 1. This result together with the asymptotic normality given in Theorem 

indirectly leads to the asymptotic distribution of (a(A)"'",/3(A)"'")"'" presented in the following 
corollary. 


Corollary 1. Under the conditions in Theorem 0 we have for any vector a^ G , as 
n —)■ oo , 


a-^(a.)aJ((a(A) - oP)\ (3(A) - (3^)T ^ Ar(0,1), 


with an(an) given in m- As a result, we have for any vectors a„i G and a „2 £ R^, as 
n —)■ cx), cr“/(a„i)a)()(cK(A) — a") —)■ iV(0,1) and cr“2^(a„2)aj2(/3(A) — /3°) —)■ iV(0,1), where 

nl/2 


(TniM = (T aJi{Z^Z-(Z^X)(X^X)-3x'^Z)}-'a„i 


<Jn2{an2) = a - (X^Z)(Z^ Zy\Z^X.)}-^an2 


T ry\-l/rvTvNT “1 


nl/2 


Remark 5. The asymptotic distribution of the penalized estimators provides a theoreti¬ 
cal justihcation for further conducting statistical inference about subgrouping. By the results 
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in Corollary [H for given a„i G and a „2 G R^, 100(1 — a)% confidence intervals for 
and a((’ 2 / 3 ° are given as ai^-^OL{\)±Za/ 2 ^ni{'^i) and a^ 2 PW^^a/ 2 ^n 2 {^ 2 )■, respectively, where 
Zai 2 is the (1 —a/ 2)100 percentile of the standard normal, and a„i(a„i) and an 2 {sLn 2 ) are esti¬ 
mates of cT„i(a„i) and a„ 2 (an 2 ) with estimated by = {n — K—p)~^ ^ 

where K is the number of distinct values in /i(A). 


5 Simulation studies 


In this section, we conduct simulation experiments to investigate the numerical performance 
of our proposed estimators. 


We use the modihed Bayesian Information Criterion (BIC) fjWang et all (120071 )) for high¬ 


dimensional data settings to select the tuning parameter by minimizing 

BIC = log[^, + + 

^—^*=1 n. 


(16) 


where Cn is a positive number which ca i i depe nd on n. Wh e n C„ = 1, the modihed BIC 
reduces to the traditional BIC fISchwarzl (ll978fB . IWang et all (120091) used Cn = log(log(d)) 
in their simulation study when the number of predictors which is d diverges with sample 
size. In this paper, we adopt the same strategy and let = clog(log((i)), where d = n + p 
and c is a positive constant. In our analysis, we select A by minimizing the modihed BIC 
and use a hxed value for 'd and 7 . 

Example 1. We simulate data from the model 

Pi = + f3+ei,i = I,-■ ■ ,n, (17) 


where Xj = [xn,... are generated from the multivariate normal distribution with 

mean 0, variance 1 and an exchangeable correlation p = 0.3 and the error terms e* are from 
independent iV(0,0.5^). We simulate (3 = , (3^)^ from independent Uniform[0.5,1]. 

We generate pi from two diherent values —a and a with equal probabilities, i.e., we generate 
them from the distribution: p{pi = —a) = p{pi = a) = 1 / 2 , so that there are two intercepts 
CDi = —a and 02 = «. In our simulation studies, we take diherent values of a for illustration 
of our proposed method. It is noteworthy that for smaller value of cc, it is more difficult to 
identify the two groups. 
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In our analysis, we choose to fix = 1 and 7 = 8 . We compare the performance of the 
estimators with the ADMM algorithm by using the two concave penalties (MCP and SCAD) 
and using a weighted Li penalty 


P'y (I pi Pj I) I Pi pj I) 


which requires specification of the weights Uij. As discussed in IChi and Langel (120011 ). the 


choice of the weights can dramatically affect the quality of the results in cluster analysis. In 
the regression context such as in our study, it is even more challenging to select the weights. 

For the Li penalty, we let the weight be = exp{—(j){yi — i/jY) which is a Gaussian 
kernel defined on the distance of two points. The constant 0 is nonnegative. When 0 = 0, 
it corresponds to the Lasso penalty. Note that it is unclear what weights we need to apply 
to obtain optimal results. We here use the Gaussian kernel as the weight to illustrate this 
point by using different values for 0 . 

Figured] displays the solution paths for the means (/ii,...,/i„) against A values by using 
MCP and SCAD, and the Li penalties with 0 = 0,0.5,1,2, respectively, based on one 
sample with n = 100 and a = 1. We observe that the MCP and SCAD have similar solution 
paths as shown in Figure dl For these two penalties, the estimated values for converge 
to two different values around —1 and 1 which are the true values for the intercepts of the 
two groups, when A reaches certain value (around 0.38 for both MCP and SCAD). They 
eventually converge to one value when A exceeds 0.6. The Li penalty, however, shows a 
different solution path from MCP and SCAD, and the solution paths look quite differently 
for different values of 0 , so that the choice of weights can dramatically affect the estimation 
results. When 0 = 0 which is the LASSO penalty, we see that the estimated values for /ids 
converge quickly as the A value increases until they converge to a common point around 0 
when A reaches 0.035. As a result, it cannot effectively identify the groups of the /x value. 
By looking at the plots for 0 = 0.5,1, 2, we observe that as the 0 value becomes larger, the 
estimated values converge to one point more slowly. 

Next we conduct the simulations by selecting A via minimizing the modified BIG given 
in (d6H . Recall that we let = clog(log(n + p)), where n + p is the number of components 
in and /3, and c is a positive constant. We use different c values by letting c = 5,10 


14 





Figure 1: Solution paths for the means (/ii,. 
and Li penalties, respectively, in Example 1. 

MCP 




0.00 0.01 0.02 0.03 0.04 0.05 
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in our estimation procedure. We consider different values for a by letting a = 1, 1.5,2, so 
that the difference of the true common values between the two groups varies from 2 to 4. 
Table [U reports the mean, the median and standard error (s.e.) of the estimated number of 
groups K by the MCP, SCAD and Li methods with 0 = 1 and 2 based on 100 simulation 
realizations with n = 100. Moreover, to study the estimation accuracy, in Table |2] we report 
the average value and the standard error shown in the parentheses of the square root of the 
mean squared errors (MSE) for the estimated values of /j. and f3 for the MCP, SCAD and 
Li estimators and the oracle estimator given in flTT]) . The square roots of the MSE for fj, 
and f3 are, respectively, dehned as ||^ — /x|| /y/n and \ \ (3 — /3||/y^ for each realization. 


Table 1: The mean, median and standard error (s.e.) of K by the MCP, SCAD and Li 
methods with 0 = 1.0 and 2.0 based on 100 realizations with n = 100 in Example 1. 


c 

cx 

1.0 

1.5 

2.0 



mean 

median 

s.e. 

mean 

median 

s.e. 

mean 

median 

s.e. 


MCP 

2.57 

2.00 

0.90 

2.41 

2.00 

0.93 

2.10 

2.00 

0.44 

5.0 

SCAD 

2.58 

2.00 

0.96 

2.37 

2.00 

0.90 

2.18 

2.00 

0.63 


Li(0 = 1.0) 

1.76 

1.00 

0.99 

2.71 

3.00 

0.88 

2.50 

2.00 

0.82 


Li(0 = 2.O) 

3.03 

3.00 

1.16 

3.13 

3.00 

1.19 

3.25 

3.00 

1.00 


MCP 

2.10 

2.00 

0.33 

2.04 

2.00 

0.20 

2.01 

2.00 

0.11 

10.0 

SCAD 

2.11 

2.00 

0.35 

2.04 

2.00 

0.20 

2.02 

2.00 

0.14 


Li(0 = 1.0) 

1.40 

1.00 

0.65 

5.10 

4.00 

3.00 

3.75 

3.00 

1.60 


Li(0 = 2.O) 

2.29 

2.00 

0.78 

3.03 

3.00 

1.02 

3.25 

3.00 

1.00 


In Table [H for both MCP and SCAD methods we observe that the median value of K 
among the 100 replications is 2 for all cases, which is the true number of groups in our 
model, and the mean values are close to 2 for different values of a. For larger value of a, it 
is easier to detect the subgroups, so that correspondingly we observe that the mean values 
of K are closer to 2 for larger a. The MCP and SCAD can identify the groups for both 
values of c, although they perform better with c = 10 by having smaller standard errors. The 
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Table 2: The mean and standard error (s.e.) shown in parentheses of the square root of the 
MSE for the estimated values of n and f3 for the MCP, SCAD and Li penalty estimators 
and the oracle estimators with 0 = 1.0 and 2.0 based on 100 realizations with n = 100 in 
Example 1. 





/3 

c 

OL 

1.0 

1.5 

2.0 

1.0 

1.5 

2.0 


MCP 

0.409 

0.246 

0.132 

0.043 

0.076 

0.062 



(0.108) 

( 0.192) 

(0.151) 

(0.034) 

(0.045) 

(0.038) 

5.0 

SCAD 

0.414 

0.240 

0.158 

0.091 

0.075 

0.065 



(0.116) 

(0.190) 

(0.168) 

(0.036) 

(0.044) 

(0.040) 


Li(0 = 1.0) 

0.874 

0.370 

0.185 

0.118 

0.084 

0.066 



( 0.202) 

(0.237) 

(0.173) 

(0.040) 

(0.047) 

(0.036) 


Li(0 = 2.O) 

0.637 

0.274 

0.167 

0.106 

0.076 

0.064 



(0.226) 

(0.180) 

(0.153) 

( 0.040) 

(0.041) 

(0.035) 


MCP 

0.407 

0.230 

0.154 

0.086 

0.069 

0.062 



(0.139) 

(0.178) 

(0.164) 

(0.035) 

(0.034) 

(0.030) 

10.0 

SCAD 

0.409 

0.234 

0.155 

0.086 

0.069 

0.061 



(0.138) 

(0.178) 

(0.163) 

(0.034) 

(0.035) 

(0.030) 


Li((/) = 1.0) 

0.946 

0.265 

0.203 

0.121 

0.075 

0.069 



( 0.138) 

(0.142) 

(0.169) 

( 0.039) 

(0.038) 

(0.038) 


Li(0 = 2.O) 

0.769 

0.287 

0.167 

0.113 

0.078 

0.064 



(0.215) 

(0.210) 

(0.153) 

( 0.039) 

(0.046) 

( 0.035) 
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Li penalties with both values for 0 in general have worse performance than the MCP and 
SCAD penalties. They have mean and median values for K further away from 2 and larger 
standard errors. Moreover, the performance of the Li penalty is not stable. The Li penalty 
with 0 = 1 tends to select less than two groups for a = 1.0 and more than two groups for 
a = 1.5, 2.0, while the Li penalty with 0 = 2 tends to select more groups in general. For 
a = 1.5 and 2.0 and c = 5.0, the Li penalty with 0 = 1 performs better than the Li penalty 
with 0 = 2 by having K values closer to two and smaller standard errors, but for other 
cases the Li penalty with 0 = 2 seems to perform better. Thus, we see that different weights 
applied to the Li penalty may signihcantly affect the performance of the resulting estimator, 
and there is no clear rule on what weight to be used in the general situation. Table |2] shows 
that the MCP and SCAD methods have smaller MSE values than the Li penalty methods 
in general since they have more accurate selection results and produce less biased estimates. 

To evaluate the asymptotic normality established in Corollary [H Table [3] lists the em¬ 
pirical bias (Bias) for the estimates of the two intercepts ai and Q! 2 , and it also presents 
the average asymptotic standard error (ASE) calculated according to Corollary [1] and the 
empirical standard error (ESE) based on 100 replications for the MCP and SCAD meth¬ 
ods with c = 10 as well as the oracle estimator (ORACLE). The biases are around zero 
for all cases. Moreover, we observe that the asymptotic standard errors for the MCP and 
SCAD methods are similar to those for the ORACLE estimator. This result supports our 
asymptotic normality result in Corollary [1] 

Lastly, we conduct inferences on the difference between groups. Table |4] presents the 
average p-values for testing 'Ho : ai = a 2 based on the 100 simulation realizations. We use 
(T„i(a)“^(«i(A) — a 2 (A)), a = (1, —1), as the test statistic which has the asymptotic normal 
distribution given in Corollary [H and the estimates ai(A) and a 2 (A) are obtained by the 
MCP and SCAD methods with c = 10. We obtain the p-values close to zero for all cases, so 
that the difference between the groups is further conhrmed by the inference procedure. 

Example 2. We simulate data from model flTT)) with the predictors, the error terms and 
the coefficients /3 generated from the same distributions as given in Example 1. We simulate 
jUi from three different values —2, 0, 2 with equal probabilities. We use the modihed BIC to 
select the tuning parameter A by letting Cn = 5log(log(n-l-p)). Figure [2] shows the boxplots 
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Table 3: The empirical bias (Bias) for the estimates of ai and 02 , and the average asymptotic 
standard error (ASE) calculated according to Corollary [T] and the empirical standard error 
(ESE) based on 100 replications for the MCP and SCAD methods and the oracle estimator 
(ORACLE) with c = 10 in Example 1. 




a = 

1.0 

a = 

1.5 

a 

= 2.0 



Q-l 

0.2 

(y.\ 

tt2 

tti 

0-2 


Bias 

0.037 

-0.066 

0.031 

-0.055 

0.065 

-0.083 

MCP 

ASE 

0.071 

0.070 

0.072 

0.071 

0.075 

0.074 


ESE 

0.104 

0.117 

0.085 

0.092 

0.085 

0.092 


Bias 

0.040 

0.069 

0.036 

-0.060 

0.067 

-0.087 

SCAD 

ASE 

0.071 

0.070 

0.072 

0.071 

0.075 

0.074 


ESE 

0.103 

0.119 

0.085 

0.094 

0.084 

0.094 


Bias 

-0.009 

-0.005 

-0.010 

-0.005 

-0.010 

-0.005 

ORACLE 

ASE 

0.072 

0.072 

0.072 

0.072 

0.072 

0.072 


ESE 

0.070 

0.067 

0.070 

0.067 

0.070 

0.067 


Table 4: The average p-values for testing "Ho : «i = 02 based on the 100 simulation realiza¬ 
tions with the estimates Si (A) and S 2 (A) obtained by the MCP and SCAD methods with 
c = 10 in Example 1. 


OL 

1.0 

1.5 

2.0 

MCP 

< 0.001 

< 0.001 

< 0.001 

SCAD 

< 0.001 

< 0.001 

< 0.001 
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of the estimated number of subgroups K and the square root of the MSE for the estimated 
values of and (3, respectively, by using MCP, SCAD and Li penalty with 0 = 1,2 methods 
based on 100 simulation realization with n = 100. In the hrst plot, we observe that for the 
MCP and SCAD methods, the median value for K is 3, which is the true number of groups 
in our model. For some replications, they select more groups than three. For the Li penalty 
with 0=1, the median value for iC is 3 as well. However, some replications have more than 
three and others have less than three for the K value. Moreover, for this example, the Li 
penalty with 0 = 2 tends to select more groups in all replications. The other two plots show 
that the MCP and SCAD have much smaller MSE values than the two Li penalty methods. 

Example 3. We generate data from a homogeneous model given asyi = p+xf /3+ei, i = 
1,..., 100. The predictors, the error term and the coefficients are simulated in the same way 
as in Example 1. Let /i = 2. We £t the heterogeneous model ([T]) by using our proposed 
method. In practice, we choose the value of A by the modihed BIC method as illustrated in 
Examples 1 and 2. In this example, for illustration of our penalization estimation and the 
subsequent inference method for subgroup identihcation in the homogeneous model, we use 
a set of different values for the tuning parameter A. For small values of A, we expect to have 
more identified groups. We further conduct inference on heterogeneity between groups by 
using the asymptotic normality in Corollary [H 

To test on heterogeneity, we formulate the hypothesis that "Hq ^ = (lA'I — 1)”^ 

, where ai is the intercept for the largest group, so that we test on the difference of the 
intercept for the largest group and the average intercept for other groups. For demonstration 
of our inference procedure applying to the homogeneous model, we choose a set of small 
values A = (0.15, 0.20, 0.25), so that more than one groups are identihed by the penalization 
procedure. For small values of A, since subgroups with small sizes may be identihed, we 
adjust to estimate by = (n — 2 — p)~^ ~ i where pi = Si for i ^ Gi 

and Pi = {\K\ — otherwise. Figure [3] shows the boxplots of the p-values for 

the hypothesis testing based on the 100 simulation realizations for different values of A. The 
estimates of the intercepts are obtained by the MCP and SCAD methods, respectively. We 
observe that the median values of the p-values are large in general. 

Example 4. Case 1. We simulate data from the same data generating process as Exam- 
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Figure 2: Boxplots of the estimated number of subgroups K and the square root of the MSE 
for the estimated values of ^ and /3, respectively, by using MCP, SCAD and Li with 0 = 1,2 
methods based on 100 simulation realizations with n = 100 in Example 2. 



SCAD MCP LI, phi=1 L1, phi=2 



SCAD MCP L1,phi=1 L1,phi=2 



SCAD MCP LI, phi=1 L1, phi=2 


pie 2, so that the data are generated from three groups with the same size (balanced groups). 
In this example, we aim to compare the performance of cluster analysis by using different 
penalties including MCP, SCAD, and truncated Li as well as by using the Gaussian mix¬ 


ture model-based clustering algorithm from the R package of MCLUST flFralev and Rafterv 
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Figure 3; Boxplots of the p-values for the hypothesis testing in Example 3 based on the 100 
simulation realizations for different values of A. 


MCP 


SCAD 



lambda=0.15 lambda=0.2 lambda=0.25 



lambda=0.15 lambda=0.2 lambda=0.25 


(120021) ). In our regression setting, we need to apply MCLUST to i/i — for cluster anal¬ 


ysis. One simple way is to obtain the estimate /3 of /3 by the ordinary least squares (OLS) 
hrst, and then apply the MCLUST to the pseudo observations yi — x)'"/3, which is adopted 
in our numerical analysis. 

For the penalized methods, we apply the same iterative algorithm as described in Section 
lO.ll to obtain the parameter estimates by using different penalties. The same BIC method is 
applied to choose the tuning parameter as described in Example 2. It is worth noting that the 
rest steps remain the same except for the estimation of rjij, which needs some modihcations 
due to the use of different penalties. Specihcally, for the truncated Li penalty which has 
the form p(|t|, A;r) = Amin(|t|;r), where r is the thresholding parameter, the estimate of 
rjij is obtained by minimizing h{rjij) = — 77 ^)^ -|- A minf hjd: r). We then apply the 

difference of convex programming technique as given in fjShen and Huand (120101) ) to obtain 
the minimizer of h{r]ij). In this algorithm, the function hirjij) needs to be decomposed into 
difference of two convex functions hi{rjij) — h 2 {rjij)^ where hi{rjij) = — Tjij)^ -|- X\'r]ij\ and 

^ 2 (^ 77 ) = ~ '^)+- This enables us to approximate h{r]ij) by an upper convex function 
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at iteration m + 1 which results in 




^m+1) 


^m+1) I 

ij \ 




x/^ 


^m+1) i^m+l) 




if |^<;>| > r 
otherwise 


One important evaluation criterion for clustering methods is based on their ability to 
reconstruct t he true underlying cluster structure. We, therefore, use the Rand Index measure 
flRandl (jl97l|)) to evalute the accuracy of the clustering results. The Rand Index is viewed 
as a measure of the percentage of correct decisions made by the algorithm. It is computed 
by using the formula: 


RI = 


TP + TN 


TP + FP + FN + TN’ 

where a true positive (TP) decision assigns two observations from the same ground truth 
group to the same cluster, a true negative (TN) decision assigns two observations from 
different groups to different clusters, a false positive (FP) decision assigns two observations 
from different groups to the same cluster, and a false negative (FN) decision assigns two 
observations from the same group to different clusters. The Rand Index lies between 0 and 
1. Higher values of the Rand Index indicate better performance of the algorithm. 

Table [5] presents the mean and standard error (s.e.) of iP, the square root of the MSE 
(SMSE) for the estimated /x and the clustering ac curacy (Accuracy) by d ifferent methods. 
For the truncated Li, by taking the same strategy as Shen and Huang f 2nifll ). we use different 
values r = 0.5,1.0,1.5 for the thresholding parameter. In the MCLUST column, it shows the 
results by using the MCLUST package with the number of groups selected by the BIC method 
which is the default method in the MCLUST package and is widely used for determining the 
number of clusters in practice. In the MCLUST-MCP column, it shows the results by using 
the MCLUST package with the number of groups determined by our proposed penalized 
approach with MCP penalty. 

From Table [H we observe that the proposed concave fusion penalized methods, MCP and 
SCAD, have better performance than other methods. They have higher clustering accuracy 
rates and smaller SMSE values for Jl than others. This result is further reflected by the 
boxplots in Figure |4] of accuracy rates for the MCP, SCAD, truncated Li with r = 1.0, and 
MCLUST methods. For the truncated Li, it has the best performance at r = 1.0 among 
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Table 5: The mean and standard error (s.e.) of K and the square root of the MSE (SMSE) 
for the estimated /x as well as the clustering accuracy (Accuracy) by different methods based 
on 100 realizations with n = 100 for Case 1 of Example 4 with balanced groups. 




MCP 

SCAD 

Truncated Li 

T = 0.5 T = 1.0 r = 1.5 

MCLUST 

MCLUST-MCP 

K 

mean 

3.570 

3.600 

6.930 

3.960 

2.390 

2.400 

— 


s.e. 

0.671 

0.696 

0.956 

0.887 

0.737 

0.711 

— 

SMSE of /X 

mean 

0.589 

0.585 

0.597 

0.605 

0.963 

0.791 

0.607 


s.e. 

0.157 

0.154 

0.158 

0.164 

0.195 

0.380 

0.134 

Accuracy 

mean 

0.897 

0.892 

0.829 

0.873 

0.707 

0.777 

0.864 


s.e. 

0.059 

0.057 

0.066 

0.064 

0.112 

0.193 

0.058 


the three different values for r. Moreover, the three penalized methods, MCP, SCAD and 
truncated Li with r = 1.0, can identify the cluster membership more correctly than the 
MCLUST method by observing higher accuracy rates. The MCP improves the accuracy 
rate by 15.4% compared to the MCLUST. It is worth noting that in order to apply the 
Gaussian mixture model-based method, how many clusters to be used is always crucial. For 
the MCLUST-MCP, instead of using the BIC, we use our proposed penalized MCP approach 
to determine the number of clusters and then apply the MCLUST, we see that the accuracy 
rate is improved compared to the MCLUST with the BIC method. This result indicates 
that our proposed concave penalized method also provides a possible tool to determine the 
number of clusters for the Gaussian mixture model-based method. 

Case 2. In this setting, we generate data from three groups with different sizes (un¬ 
balanced groups). We consider two simulation designs: Design 1: /ij’s are generated from 
three different values —2, 0, 2 with probabilities 0.2, 0.3, 0.5, respectively, and Design 2: /Xj’s 
are generated from —2, 0, 2 with probabilities 0.1, 0.3, 0.6, respectively. Other terms are 
simulated according to the same setting as Case 1. Table |6] presents the mean and standard 
error (s.e.) of K, the square root of the MSE (SMSE) for the estimated /x and the cluster- 
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Figure 4: Boxplots of the clustering accuracy for the MCP, SCAD, truncated Li and 
MCLUST based on the 100 simulation realizations in Case 1 of Example 4. 



MCP SCAD truncated LI MCLUST 


ing accuracy (Accuracy) by the MCP, SCAD, truncated Li, MCLUST and MCLUST-MCP 
based on 100 realizations. We see that for the MCP, SCAD and truncated Li methods, the 
performance for the two unbalanced designs is comparable to that for the balanced design 
in Case 1. Again the MCP and SCAD outperform the other methods. The performance 
of MCLUST-MCP shows improvement over MCLUST. For the MCLUST, the estimated 
number of groups K decreases as the design becomes more unbalanced. The smallest group 
is not successfully identihed for most replications. For the penalized method, however, the 
K values remain similar for different designs. Hence, the MCLUST may be more sensitive 
to cluster sizes based on these simulation results. 


6 Empirical example 


In this section, we use the Cleveland Heart Disease Dataset to illustrate our method. This 


dataset is available at the UCI machine learning repository. 


measurements on 297 individuals. As described in 


'he dataset has 13 clinical 


Lauer et all (119991 ). the maximum heart 


rate achieved (thalach) variable is related to cardiac mortality. In addition, some categorical 
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Table 6: The mean and standard error (s.e.) of K and the square root of the MSE (SMSE) 
for the estimated /x as well as the clustering accuracy (Accuracy) by different methods based 
on 100 realizations with n = 100 for Case 2 of Example 4 with unbalanced groups. 




MCP 

SCAD 

Truncated Li 

r = 0.5 r = 1.0 r = 1.5 

MCLUST 

MCLUST-MCP 

Design 1 

K 

mean 

3.730 

3.660 

6.540 

3.870 

2.360 

2.380 

— 


s.e. 

0.670 

0.713 

1.049 

0.928 

0.659 

0.663 

— 

SMSE of y 

mean 

0.561 

0.556 

0.585 

0.577 

0.893 

0.771 

0.592 


s.e. 

0.126 

0.130 

0.127 

0.146 

0.135 

0.309 

0.124 

Accuracy 

mean 

0.890 

0.891 

0.822 

0.872 

0.733 

0.792 

0.846 


s.e. 

0.048 

0.048 

0.051 

0.058 

0.092 

0.152 

0.064 

Design 2 

K 

mean 

3.700 

3.730 

6.350 

3.960 

2.690 

2.230 

— 


s.e. 

0.717 

0.709 

0.880 

1.197 

1.473 

0.679 

— 

SMSE of /i 

mean 

0.488 

0.487 

0.522 

0.502 

0.852 

0.763 

0.579 


s.e. 

0.121 

0.120 

0.122 

0.127 

0.135 

0.220 

0.148 

Accuracy 

mean 

0.898 

0.899 

0.823 

0.877 

0.713 

0.793 

0.818 


s.e. 

0.048 

0.047 

0.056 

0.054 

0.118 

0.123 

0.097 


variables are also used to check heart problems including chest pain type, exercise induced 
angina indicator, ST depression induced by exercise relative to rest, slope of the peak exercise 
ST segment, number of major vessels colored by fluoroscopy and the heart status (normal=3; 
hxed defect=6; reversible defect=7). We use the htted value of thalach as the response 
variable by projecting it onto the linear space spanned by the categorical variables. Our 
interest is to conduct subgroup analysis for the htted value of thalach as the response y 
after adjusting for the effects of the covariates: xi =age in years; X 2 =gender; x^ =resting 
blood pressure; x^ =serum cholesterol; x^ =fasting blood sugar indicator; and xq =resting 
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Figure 5: Density plot of the response variable after adjusting for the effects of the covariates 
for the empirical example. 



electrocardiographic results. 

We hrst plot the kernel density estimates of yi — f3 in Figure [5], where (3 is obtained 
from OLS estimation. Clearly, we see that after adjusting for the effects of the covariates, 
the distribution in Figure [5] still shows multiple modes. The heterogeneity may be caused by 
some unobserved latent factors. Hence, it is not suitable to ht a standard linear regression 
model with a common intercept by using the response and the predictors. Instead we £t the 
heterogeneous model Vi = yi + xT/3+ei,i = 1,... ,297, and we identify subgroups by our 
proposed ADMM algorithm. We select the tuning parameter by minimizing the modihed 
BIC in a certain range by following the same rule as given in Example 2 of Section [5l As a 
result, two major groups are identihed by both of the MCP and SCAD methods. We also 
conduct inference by testing the difference of the intercepts for the two identihed groups by 
using the asymptotic normality in Corollary [H and we find that the p-values are close to 
zero for both of the MCP and SCAD methods. 

In Table [7] we report the estimated coefficients /3, their standard deviations (s.d.) and 
the p-values for testing the signihcance of the coefficients by the proposed method with the 
MCP and SCAD pairwise fusion, and the OLS estimation by assuming a common intercept. 
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Table 7: The estimated values (est) for the coefficients f3, their standard deviations (s.d.) 
and the p-values for testing the signihcance of the coefficients by the OLS, MCP and SCAD, 
respectively. 




/3i 

f^2 


Pi 

P, 

P, 


est 

-0.345 

-4.120 

-0.028 

-0.008 

0.183 

-1.359 

OLS 

s.d. 

0.083 

1.534 

0.042 

0.0142 

2.031 

0.725 


p-value 

< 0.001 

0.007 

0.502 

0.563 

0.928 

0.061 


est 

-0.355 

-3.825 

-0.007 

-0.006 

0.628 

-1.849 

MCP 

s.d. 

0.040 

0.752 

0.021 

0.007 

1.016 

0.354 


p-value 

< 0.001 

< 0.001 

0.563 

0.558 

0.283 

< 0.001 


est 

-0.358 

-3.698 

-0.012 

-0.004 

1.091 

-2.129 

SCAD 

s.d. 

0.040 

0.743 

0.021 

0.007 

1.005 

0.351 


p-value 

< 0.001 

< 0.001 

0.558 

0.554 

0.278 

< 0.001 


The standard deviation for the MCP and SCAD methods is calculated by the asymptotic 
formula given in Corollary [H The age and gender variables show a strongly signihcant effect 
by these three methods with p-values close to zero, while resting blood pressure and serum 
cholesterol show a very weak effect by the three methods with large p-values. Moreover, 
by the MCP and SCAD methods, the effects of fasting blood sugar indicator and resting 
electrocardiographic results become more signihcant than the results by the OLS method. 
This result indicates that by recovering the hidden heterogeneous structure of the data, it 
helps us identify useful variables which may have effects on the response. We also calculate 
the coefficient of determination R'^, and obtain R^ = 0.667, 0.704 and 0.109 for MCP, 
SCAD and OLS methods. We see that taking into account the subgroup structure leads 
to a signihcant improvement of the model htting. Next we apply the Gaussian mixture 
model-based method to this dataset for cluster analysis. As described in Example 4 of the 
simulation section, we apply the MCLUST to the pseudo observations i/i — x^/3, where (3 is 
obtained from the OLS. As a result, two subgroups are identihed. For the real data, since 
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the true underlying cluster structure is unknown, we cannot use the external criterion, Rand 
Index measure, to evaluate and compare different methods. Instead, we use the internal 
criterion, the Davies-Bouldin index, to assess the quality of clustering algorithms, which is 
calculated by the formula: DB= K~^ ^ + crk')/d{ck,Ck')), where K is the 

estimated number of clusters, is the centroid of cluster x, is the average distance of 
all observations yi — xjf3 in cluster x to centroid Cx, and d{ck,Ck') is the distance between 
centroids Ck and Ck'- The clustering algorithm that has the smallest Davies-Bouldin index 
is considered the best algorithm based on this criterion. The Davies-Bouldin index values 
for MCP, SCAD and MCLUST are 0.469, 0.467, and 0.506, respectively, so that the MCP 
and SCAD outperform the MCLUST based on this criterion. 


7 Discussion 


The model (II]) is related to the Neyman-Scott models flNeyman and ScottI (119481 )). In the 
terminology of Neyman and Scott, the /ij’s in ([T]) are called incident parameters. In the 
literature, such parameters are usually treated as nuisance parameters, while the main in- 
terest lies in estim ating the common parameter such as {/3,cr^} in ([T]) based on panel data 
flLancasterl (120001) ). The problem we consider here is different and we use the /ids to repre¬ 
sent latent heterogeneity in the observations for the purpose of conducting subgroup analysis. 
Also we do not assume that panel data are available, so model ([T]) is not identifiable without 
a constraint on the parameter space such as the subgroup structure considered in the present 
paper. 

It is also possible to adopt a random effects model approach by taking the /ids in ([T]) 
as random variables from a mixture distribution. Then the estimation and inference can be 
carried out using a likelihood-based method. The main difficulty in applying this approach 
is that it requires specifying the number of subgroups, the parametric form of the mixture 
distribution, and an assumption on the error distribution. It is worth noting that the choice 
of the number of groups is always crucial in mixture model-based methods. Different meth¬ 


ods on this topic have been proposed in t 


le literature. Among them, the Bayesian model 


selection criteria (iFralev and RaftervI (119981) I are widely used, and the gap statistic proposed 
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in 


Tibshirani et all (120011) is also an important tool. Our proposed penalized method provides 


another possible approach to automatically estimate the number of groups with reliable the¬ 
oretical properties. By using the MCLUST, our simulation studies show that the clustering 
accuracy is improved by using the proposed penalized method to select the number of groups 
compared to the BIG. 

In our theoretical results, we allow p, the dimension of the regression parameter /3, to 
diverge with n, but require it to be smaller than n. For models and data with p > n, a. 
sparsity condition needs to be imposed on (3 and an additional penalty term to enforce the 
sparsity is required. Computationally, we can still derive an algorithm within the framework. 
However, much extra effort is needed to establish the theoretical properties of the estimators 
in this high-dimensional setting. This is an interesting and challenging technical problem 
and deserves further investigation, but is beyond the scope of this paper. 

The proposed method can be extended to other models including the generalized lin¬ 
ear models and regression models for censored survival data. Although these extensions 
appear to be conceptually straightforward, it is a nontrivial task to develop computational 
algorithms and establish theoretical properties in these more complicated models. 
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